Installation

Installing R

It is straightforward installing R on your machine. Follow these steps:

  1. Go to the CRAN (Comprehensive R Archive Network) R website (http://cran.r-project.org). If you type ‘r’ into Google it is the first entry
  2. Choose to download R for Linux, Mac or Windows
  3. For Windows users, just install ‘base’ and this will link you to the download file
  4. For Mac users, choose the version relevant to your Operating System
  5. If you are a Linux user, you know what to do!

Installing RStudio

We will use RStudio in this Course. RStudio is a free front-end to R for Windows, Mac or Linux (i.e., R is working in the background). It makes working with R easier, more productive, and more organised, especially for new users. There are other front-ends, but RStudio is the most popular. To install:

  1. Go to the RStudio website (http://www.rstudio.com)
  2. Choose the Download RStudio button
  3. Choose run RStudio on your Desktop and follow the prompts
  4. Choose the relevant ‘Installers for ALL Platforms’ to download

Using RStudio: A quick guide!

RStudio has four main panes, each in a quadrant of your screen. You can set what appears in each (through Tools/Options menu in Windows or RStudio/Preference on a Mac) but the default has:

  • Console (bottom left)
  • Source Editor (top left)
  • Environment and History (top right)
  • Plots, Files, Packages, Help, Viewer (bottom right)

We will discuss the main ones in turn.

Console

This is where you can type code that executes immediately. It is also known as the command line. Entering code in the command line is intuitive and easy. For example, we can use R as a calculator by typing into the Console (and pressing Enter after each line):

6 + 3
## [1] 9
2 ^ 4
## [1] 16

TIP: Type it in! Although you could simply copy code from these PDF notes into the Console, you really shouldn’t. The first reason is that you might unwittingly copy invisible formatting errors into R, which will make the code fail. But more importantly, typing code in yourself gives you the practice you need, and allows you to make (and correct) your own errors. This is an invaluable way of learning.

Spaces are optional around simple calculations. We can also use the assignment operator <- (read as ‘gets’ or ‘is assigned’) to assign any calculation to a variable so that we can access it later (the = sign would work, too, but it’s bad practice to use it… we’ll talk about this as we go):

a <- 2
b<-7
a + b
## [1] 9

Spaces are also optional around assignment operators. It is good practice to use single spaces in your R scripts, though. This makes them more readable to the human eye and also to the machine. Try this:

d<-2
d < -2
## [1] FALSE

Note that the first line of code assigns d a value of 2. But the second asks R whether this variable has a value less than -2. It responds with FALSE. The clearest way of specifying this is:

d <- 2

Challenge: Is R case sensitive when assigning variables? Is A the same as a? Check this out yourself.

We can also assign a vector by using the combine (c) function:

apples <- c(5.3, 3.8, 4.5)

A vector is a one-dimensional array (i.e., a list of numbers), and this is the simplest form of data used in R (you can think of a single value in R as just a very short vector!). We’ll talk about more complex (and therefore powerful) forms of data structure as we go along.

If you want to display the value of apples, then type:

apples
## [1] 5.3 3.8 4.5

TIP: Variable names It is best not to use c as the name of a value or array. Why? What other words might not be good to use?

RStudio supports the automatic completion of code using the TAB key – if there is more than one variable or function that begins with the letters you type it will provide a list that you can scroll through and select from. For example, type the following and then the TAB key:

app

And select the first item from the list and enter.

The code-completion feature also provides brief inline help for functions whenever possible. For example, type the following and press the TAB key:

mean

Other ways to get help in R from the Console include:

?mean

OR

help(mean)

The RStudio Console automatically maintains a “history” so that you can retrieve previous commands, a bit like your Internet browser. In the Console, press the up arrow, and see what happens.

If you wish to review a list of your recent commands and then select a command from this list you can use Ctrl+Up to review the list (Cmd+Up on the Mac).

The Console title bar has a few useful features:

  1. It displays the current R working directory (more on this in the next section)
  2. It provides the ability to interrupt R during a long computation (a red stop sign will appear whilst code is running); alternatively click in the console and hit the Esc key
  3. It allows you to minimize and maximize the Console in relation to the Source pane using the buttons at the top-right or by double-clicking the title bar)

Source Editor

It is tedious to type code one line at a time at the Console, and you could lose all your work if the machine crashed. This is where it is best to write a script (you could also call this a “program”, if you like). A script is a collection of code that you execute sequentially to perform a task. You can save the scipt and bring it up at any time in the future. This is where the Source Editor can help you open, edit and execute your scripts.

Let us open a script. There are two main ways to do this:

  1. In the menu, go File/Open and navigate to the file you want. Open the file SimpleScatter.r and it will appear in the Source editor
  2. Use Windows Explorer (Finder on Mac) and navigate to the file SimpleScatter.r. This is a text file with code in it and given a .r extension. Now make RStudio the default application to open .r files (right click on the file Name and set RStudio to open it as the default). Now double click on the file – this will open it in RStudio and bring it into the Source editor. You can see that the title bar in the Console has the working directory listed as the folder that the file was opened from.

Note that .r files are simply standard text files, so they can be created in any text editor and saved with a .r extension, but the Source editor has the advantage of providing syntax highlighting, code completion, and smart indentation. You can see the different colours for numbers and there is also highlighting to help you count brackets (put your cursor insertion point before a bracket and push the right arrow and you will see its partner bracket highlighted). We can execute R code directly from the Source Editor. Try the following:

  1. Execute a single line, go Ctrl+Enter (for Windows machines) or Cmd+Enter (for Macs). Try running each line one at a time
  2. If you want to execute the whole script, select the Source icon or highlight all the lines of code and use Ctrl+Enter (or Cmd+Enter)

We will use the Source Editor extensively in this workshop.

Environment, History panes

The Environment is very useful as it shows you what objects (i.e., data.frames, arrays, values and functions) you have defined. You can see the values for objects with a single value and for those that are longer, R tells you their class (i.e., the sort of data object they represent). You can even click on individual objects, and their values will appear in the Source Editor.

TIP: Click on the object z in the Environment. The object will appear as a matrix. This trick is very useful for many kinds of R objects.

Also in the Environment is the History tab, where you can see all the executed code for the session. If you highlight a line, you can send it To Console or to the Source Editor (script) to save it for later.

Files, Plots, Packages, Help, Viewer panes

The last pane has several different tabs. The Files tab has a navigable file manager. The Plots tab is where graphics appear. You will have noticed that the plot from the script ScatterExample.r appears in the Plots tab. The Packages tab shows you the packages that are installed and those that can be installed (more on this soon). The Help tab allows you to search the R documentation for help (box in top right) and is where the help appears when you ask for it from the Console.

Getting data into R

We will now see how easy it is to read data into R. R will read in many types of data, including spreadsheets, text files, binary files, and files from other statistical packages.

Data format

Preparing data

For R to be able to analyse your data, it needs to be in a consistent format, with each variable in columns and the samples (observations) in rows. The format within each variable (column) needs to be consistent and is commonly one of the following types:

  • a continuous numeric variable (e.g., fish length (say in m): 0.133, 0.145);
  • an integer variable (e.g., species richness, or counts)
  • a factor or categorical variable (e.g., Month: Jan, Feb or 1, 2,…, 12);
  • a nominal variable (e.g., algal colour: red, green, brown), which is also called a factor in R; or
  • a logical variable (i.e., TRUE or FALSE). You can also use other more specific formats such as dates and times, and more general text formats.

Naming variables

R has pedantic requirements for naming variables. It is safest to not use spaces, special characters (e.g., commas, semicolons, any of the shift characters above the numbers), or function names (e.g., mean).

data.frames

Generally, the best way to store your data is to put all your biological and environmental data into a single data structure called a data.frame so that you can analyse them together. This means having a single row per observation, with the first few variables (columns) of each row being descriptors of each of your samples (e.g., Date, Latitude, Longitude, Site, etc), and the last variable (coulmn) being the observation itself. Each column has the same number of rows, so that it resembles a matrix. The difference between a matrix and a data.frame in R is that all values in a matrix must be of the same class (i.e, all numeric or factor), whereas in a data.frame, variables can be of different classes, which is extremely useful in ecology. In essence, each row contains a data point (observation; this will often reflect the response variable in your analysis), plus as many descriptors for that data point as is available (these are generally the explanatory variables in an analysis). The first few pages of Chapter 3 of The R Book explain this nicely, but for details, see the example below…

my image

Bringing in the pre-prepared data

We are going to read in the BeachBirds.csv dataset provided (see above). These data reflect results of an experiment on beaches designed to measure the influence of off-road vehicles (ORVs) on shorebirds. We visited five different beaches on the Sunshine Coast (Sites), and at each site, drove along the shoreline in an ORV. As we drove along, we identified birds in the distance, and drove at them until they took flight. We recorded the species (Species) and sex (Sex) of the bird, the distance from the bird at which it took flight (flush.dist), as well as the distance the bird flew before settling again (land.dist). In instances where sex could not be determined, or where birds flew out of sight before landing, we marked observations NA.

The first task is to convert the Excel file supplied into a .csv file, which is our recommended format for getting data into R. Open BeachBirds.xlsx in Excel, then select “Save As” from the File menu. In the Format drop-down menu, select the option called Comma Separated Values, then hit Save. You’ll get a warning that formatting will be removed and that only one sheet will be exported; simply Continue. Your working directory should now contain a file called BeachBirds.csv.

Now let’s start writing a script by clicking on the New Document Button (in the top left and selecting R Script).

It is recommended to start a script with some basic information for you to refer back to later. Start with a comment line (the line begins with a #) that tells you the name of the script, something about the script, who created it, and the date it was created. In the source editor enter:

# Beachbirds.R. Reads in and manipulates bird data
  # <YourName> <CurrentDate>

TIP: # Comments The hash (#) tells R not to run any of the text on that line to right of the symbol. This is the standard way of commenting R code; it is VERY good practice to comment in detail so that you can understand later what you have done. Note that you can comment out entire blocks of code by highlighting it in the Source Editor and going to the menu Code and then choosing Comment/Uncomment Lines.

Setting the working directory

An important aspect of any script is its working directory. This is where R will read and write files. RStudio dispalys the current working directory within the title region of the Console

There are a number of ways to change the current working directory:

  1. Use the setwd function
  2. From within the Files pane, navigate to the directory you want to set as the working directory and then select the Files/More/Set As Working Directory menu item (navigation within the Files pane alone will not change the working directory)
  3. If you double-click on a .r file then it will set the working directory as the directory it is located in

Now we are going to use RStudio to set the working directory initially and then copy the code it produces into our script. This will mean that it will get the location and syntax correct (note on Windows machines the folders in the path are separated by \, whereas on Mac folders are separated by /).

In the Files tab, use the directory structure to navigate to the RWorkshop directory where you have saved the files. Then select More, and a menu will drop down. Select Set As Working Directory. This means that whenever you read or write a file then it will always be working in that directory. Now, see the code it executed in the Console – copy it into your script you are writing in the Source editor. It will be different for you, but mine is:

setwd("/Users/davidschoeman/Dropbox/Documents/USC Teaching/ANM203 - S2 - 2016/Code_and_Data") # NOTE that your    path will be different

TIP: Splitting lines of code If you have long lines of code, you can spread them over multiple lines. You just have to make sure that R knows something is coming, either by leaving a bracket open, or by ending the line with an operator (e.g., +, -, ,, etc.).

So now we can save our script. Choose File/Save As/ and type in BeachBirds. It will automatically add a .r extension. But where will it save it? Yes, that’s right, the Working Directory.

TIP: Organising R projects For every R project, we have a separate directory that includes the data files and outputs. If you want to get the current Working Directory then type: getwd()

Importing data

Now we have the working directory set, R will know where to look for the files we read. The function read.csv is the most convenient way to read in most biological data. There are several other ways to read in data, but .csv is usually the easiest. To find out what it does, we will read its help entry:

?read.csv

All R Help items are in the same format. A short Description (of what it does), Usage, Arguments (the different inputs it requires), Details (of what it does), Value (what it returns) and Examples. Arguments (the parameters that are passed to the function) are the lifeblood of any function, as this is how you provide information to R. You do not need to specify all arguments, as many have appropriate default values, and others might not be needed for your particular case. You will learn more about this as we go along.

There are many arguments that you can use to customize the way that your data are read, but most important are:

  1. file: the name of the data file to be read (this needs to include its path if it is not in your specified working directory); note that file names must be placed within quotation marks
  2. header: is a logical argument (TRUE/FALSE) that specifies whether R reads the first line of your file as the names of the variables it contains
  3. quote: By default, character strings can be quoted by either single ' or double " quotes and usually do not need to be changed when exporting data as .csv from Excel.

Let’s assign the data in the file BeachBirds.csv to a variable called dat:

dat <- read.csv("BeachBirds.csv", header = TRUE)

Remember that specifying header = TRUE indicates to R that the first row in the spreadsheet contains variable (column) names (headers).

Note that we called the data.frame dat, but we could equally have called it just about anything else. We selected dat because it is short and easy to type. Statisticians are lazy that way.

TIP: Importing different format data read.csv is simply a “wrapper” (i.e., a function that modifies) a more general function called read.table, which itself allows you to read in many types of files. To find out more, type: ?read.table

TIP: Dealing with missing data The .csv file format is usually the most robust for reading data into R. Where you have missing data (blanks), the .csv format separates these by commas. However, there can be problems with blanks if you read in a space-delimited format file. If you are having trouble reading in missing data as blanks, try replacing them in your spreadsheet with NA, the missing data code in R. In Excel, highlight the area of the spreadsheet that includes all the cells you need to fill with NA. Do an Edit/Replace… and leave the “Find what:” text box blank and in the “Replace with:” textbox enter NA, the missing value code. Once imported into R, the NA values will be recognised as missing data.

TIP: Choosing another file location If you would like to use a dialogue box to specify a file location, then you can use: dat <- read.csv(file.choose(), header = TRUE)

We can see that we have an object called dat. We can find out what sort of object dat is by typing:

class(dat)
## [1] "data.frame"

In this case, dat is a data.frame.

Checking your data

Once the data are in R, you need to check there are no glaring errors. It is useful to call up the first few lines of the data.frame using the function head. Try it yourself by typing:

head(dat)
##   Site       Species    Sex flush.dist land.dist
## 1    1 Oystercatcher   <NA>       11.5     131.2
## 2    1 Oystercatcher   Male       12.4     154.1
## 3    1 Oystercatcher   Male       11.5     147.8
## 4    1 Oystercatcher Female       13.6     162.5
## 5    1 Oystercatcher   Male       12.0     143.7
## 6    1 Oystercatcher   Male       20.2     139.0

This lists the first six lines of each of the variables in the data.frame as a table. You can similarly retrieve the last six lines of a data.frame by an identical call to the function tail. Of course, this works better when you have fewer than 10 or so variables (columns); for larger data sets, things can get a little messy. If you want more or fewer rows in your head or tail, tell R how many rows it is you want by adding this information to your function call. Try typing:

head(dat, n = 3)
##   Site       Species  Sex flush.dist land.dist
## 1    1 Oystercatcher <NA>       11.5     131.2
## 2    1 Oystercatcher Male       12.4     154.1
## 3    1 Oystercatcher Male       11.5     147.8

Or even more simply (as this function only takes one integer value as an argument)

head(dat, 3)
##   Site       Species  Sex flush.dist land.dist
## 1    1 Oystercatcher <NA>       11.5     131.2
## 2    1 Oystercatcher Male       12.4     154.1
## 3    1 Oystercatcher Male       11.5     147.8

You can also check the structure of your data by typing:

str(dat)
## 'data.frame':    483 obs. of  5 variables:
##  $ Site      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Species   : Factor w/ 4 levels "Gull","Oystercatcher",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Sex       : Factor w/ 2 levels "Female","Male": NA 2 2 1 2 2 2 2 2 2 ...
##  $ flush.dist: num  11.5 12.4 11.5 13.6 12 20.2 13.5 8.7 12 13.6 ...
##  $ land.dist : num  131 154 148 162 144 ...

This very handy function lists the variables in your data.frame by name, tells you what sorts of data are contained in each variable (e.g., numeric = num, integer = int, factor = Factor) and provides an indication of the actual contents of each.

If we wanted only the names of the variables in the data.frame, we could use:

names(dat)
## [1] "Site"       "Species"    "Sex"        "flush.dist" "land.dist"

Now let’s have a look at the data. You will need to do this in every script you write. If we want to refer to a variable, we specify the data.frame name then a $ sign (meaning within an object), and then the variable name. In your script, type:

dat$Sex
dat$flush.dist

This will show you what the data for Sex and for flush.dist look like.

Summary of functions and symbols we learned about this Week

Function What it does How to use it
c Combines elements into an r object called a vector c(1, 5, "apples")
help or ? Asks R for help with a specified function help(mean) or ?mean
# Tells R not to try to execute subsequent code; used to annotate scripts # This is an annotation!
getwd Asks R what the current working directory is getwd()
setwd Tells R to set the working directory to a particular folder setwd("/RCourseFiles")
read.csv Allows R to read in comma-separated values (.csv) files read.csv("BeachBirds.csv)"
file.choose Allows you to use the GUI to pick a file or folder read.csv(file.choose())
class Asks R to tell you what class an R object belongs to class(dat)
head/tail Prints the first (or last) 6 rows (items) of an R obect head(dat) / tail(dat)
str Prints out the structure of an R object str(dat)
names Asks R what named variables might exist within an R object names(dat)
$ The way we identify variables within a data.frame dat$Site